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QUALITY IMPROVEMENT TECHNIQUES IN AN AUDIO 

ENCODER 

RELATED APPLICATION INFORMATION 

5 The following concurrently-filed, U.S. patent applications relate to the present 

application: U.S. Patent Application Serial No. aa/bbb,ccc, entitled, "QUALITY AND 
RATE CONTROL TECHNIQUES FOR DIGITAL AUDIO," filed December 14, 2001, the 
disclosure of which is hereby incorporated by reference; U.S. Patent Application Serial 
No. aa/bbb,ccc, entitled, "TECHNIQUES FOR MEASUREMENT OF PERCEPTUAL 

1 0 AUDIO QUALITY," filed December 14, 2001 , the disclosure of which is hereby 

incorporated by reference; U.S. Patent Application Serial No. aa/bbb,ccc, entitled, 
"QUANTIZATION MATRICES FOR DIGITAL AUDIO," filed December 14, 2001, the 
disclosure of which is hereby incorporated by reference; and U.S. Patent Application 
Serial No. aa/bbb,ccc, entitled, "ADAPTIVE WINDOW-SIZE SELECTION IN 

1 5 TRANSFORM CODING," filed December 14, 2001 , the disclosure of which is hereby 
incorporated by reference. 

TECHNICAL FIELD 

The present invention relates to techniques for improving sound quality of an 
20 audio codec (encoder/decoder). 

BACKGROUND 

The digital transmission and storage of audio signals are increasingly based on 
data reduction algorithms, which are adapted to the properties of the human auditory 
25 system and particularly rely on masking effects. Such algorithms do not mainly aim at 
minimizing the distortions but rather attempt to handle these distortions in a way that 
they are perceived as little as possible. 

To understand these audio encoding techniques, it helps to understand how 
audio infomiation is represented in a computer and how humans perceive audio. 
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L Representation of Audio Information in a Computer 

A computer processes audio information as a series of numbers representing 
the audio infomnation. For example, a single number can represent an audio sample, 
which is an amplitude (i.e., loudness) at a particular time. Several factors affect the 
quality of the audio information, including sample depth, sampling rate, and channel 
mode. 

Sample depth (or precision) indicates the range of numbers used to represent a 
sample. The more values possible for the sample, the higher the quality is because 
the number can capture more subtle variations in amplitude. For example, an 8-bit 
sample has 256 possible values, while a 16-bit sample has 65,536 possible values. 

The sampling rate (usually measured as the number of samples per second) 
also affects quality. The higher the sampling rate, the higher the quality because more 
frequencies of sound can be represented. Some common sampling rates are 8,000, 
11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second. 

Mono and stereo are two common channel modes for audio. In mono mode, 
audio information is present in one channel. In stereo mode, audio information is 
present two channels usually labeled the left and right channels. Other modes with 
more channels, such as 5-channel surround sound, are also possible. Table 1 shows 
several fonmats of audio with different quality levels, along with corresponding raw bit 
rate costs. 



Quality 


Sample Depth 


Sampling Rate 


Mode 


Raw Bit rate 


(bits/sample) 


(samples/second) 




(bits/second) 


Internet telephony 


8 


8,000 


mono 


64,000 


telephone 


8 


11,025 


mono 


88,200 


CD audio 


16 


44,100 


stereo 


1,411,200 


high quality audio 


16 


48,000 


stereo 


1,536,000 



Table 1: Bit rates for different quality audio information 



As Table 1 shows, the cost of high quality audio information such as CD audio 
is high bit rate. High quality audio information consumes large amounts of computer 
storage and transmission capacity. 

Compression (also called encoding or coding) decreases the cost of storing and 
transmitting audio infomnation by converting the information into a lower bit rate form. 
Compression can be lossless (in which quality does not suffer) or lossy (in which 
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quality suffers). Decompression (also called decoding) extracts a reconstructed 
version of the original information from the compressed form. 

Quantization is a conventional lossy compression technique. There are many 
different kinds of quantization including unifomi and non-unifonm quantization, scalar 

5 and vector quantization, and adaptive and non-adaptive quantization. Quantization 
maps ranges of input values to single values. For example, with uniform, scalar 
quantization by a factor of 3.0, a sample with a value anywhere between -1.5 and 
1 .499 Is mapped to 0, a sample with a value anywhere between 1 .5 and 4.499 is 
mapped to 1 , etc. To reconstruct the sample, the quantized value Is multiplied by the 

1 0 quantization factor, but the reconstruction is imprecise. Continuing the example started 
above, the quantized value 1 reconstructs to 1 x 3=3; it is impossible to determine 
where the original sample value was in the range 15 to 4.499. Quantization causes a 
loss in fidelity of the reconstructed value compared to the original value. Quantization 
can dramatically improve the effectiveness of subsequent lossless compression, 

1 5 however, thereby reducing bit rate. 

An audio encoder can use various techniques to provide the best possible 
quality for a given bit rate, including transfomn coding, rate control, and modeling 
human perception of audio. As a result of these techniques, an audio signal can be 
more heavily quantized at selected frequencies or times to decrease bit rate, yet the 

20 increased quantization will not significantly degrade perceived quality for a listener. 

Transform coding techniques convert information into a form that makes it 
easier to separate perceptually important infomriation from perceptually unimportant 
information. The less important information can then be quantized heavily, while the 
more important information is preserved, so as to provide the best perceived quality for 

25 a given bit rate. Transform coding techniques typically convert information into the 

frequency (or spectral) domain. For example, a transform coder converts a time series 
of audio samples into frequency coefficients. Transform coding techniques include 
Discrete Cosine Transform ["DCT"], Modulated Lapped Transform ["MLT"], and Fast 
Fourier Transform ["FFT"]. In practice, the input to a transform coder is partitioned into 

30 blocks, and each block is transform coded. Blocks may have varying or fixed sizes, 
and may or may not overlap with an adjacent block. After transform coding, a 
frequency range of coefficients may be grouped for the purpose of quantization, in 
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which case each coefficient is quantized like the others in the group, and the frequency 
range is called a quantization band. For more information about transform coding and 
MLT in particular, see Gibson et a!., Digital Comoression for Multimedia , "Chapter 7: 
Frequency Domain Coding," Morgan Kaufman Publishers, Inc., pp. 227-262 (1998); 
5 U.S. Patent No. 6,1 1 5,689 to Maivar; H.S. Malvar, Signal Processing with Lapped 
Transforms , Artech House, NonA/ood, MA, 1992; or Seymour Schlein, "The Modulated 
Lapped Transform, Its Time-Varying Forms, and Its Application to Audio Coding 
Standards," IEEE Transactions on Speech and Audio Processing, Vol. 5, No. 4, pp. 
359-66, July 1997. 

1 0 With rate control, an encoder adjusts quantization to regulate bit rate. For 

audio information at a constant quality, complex information typically has a higher bit 
rate (is less compressible) than simple information. So, if the complexity of audio 
information changes in a signal, the bit rate may change. In addition, changes in 
transmission capacity (such as those due to Internet traffic) affect available bit rate in 

1 5 some applications. The encoder can decrease bit rate by increasing quantization, and 
vice versa. Because the relation between degree of quantization and bit rate is 
complex and hard to predict in advance, the encoder can try different degrees of 
quantization to get the best quality possible for some bit rate, which is an example of a 
quantization loop. 

20 

II, Human Perception of Audio Information 

In addition to the factors that determine objective audio quality, perceived audio 
quality also depends on how the human body processes audio information. For this 
reason, audio processing tools often process audio information according to an 

25 auditory model of human perception. 

Typically, an auditory model considers the range of human hearing and critical 
bands. Humans can hear sounds ranging from roughly 20 Hz to 20 kHz, and are most 
sensitive to sounds in the 2 - 4 kHz range. The human nervous system integrates 
sub-ranges of frequencies. For this reason, an auditory model may organize and 

30 process audio information by critical bands. For example, one critical band scale 

groups frequencies into 24 critical bands with upper cut-off frequencies (in Hz) at 100, 
200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700. 3150, 



SAW: 1 2/1 4/01 3382-61 344 1 80529.1 Express Mail No. EL 874429730 US 

5 

3700, 4400, 5300. 6400, 7700, 9500, 12000, and 15500. Different auditory models use 
a different number of critical bands (e.g., 25, 32, 55, or 109) and/or different cut-off 
frequencies for the critical bands. Bark bands are a well-known example of critical 
bands. 

5 Aside from range and critical bands, interactions between audio signals can 

dramatically affect perception. An audio signal that is clearly audible If presented alone 
can be completely inaudible in the presence of another audio signal, called the masker 
or the masking signal. The human ear is relatively insensitive to distortion or other loss 
in fidelity (i.e., noise) in the masked signal, so the masked signal can include more 
1 0 distortion without degrading perceived audio quality. Table 2 lists various factors and 
how the factors relate to perception of an audio signal. 



Factor 


Relation to Perception of an Audio Signal 


outer and middle 

Cdl li Cll lOICl 


Generally, the outer and middle ear attenuate higher frequency 
infnrmation anri na^^ middle freauencv information Noise is less 
audible in higher frequencies than middle frequencies. 


noise in the 
auditory nerve 


Noise present in the auditory nerve, together with noise from the 
flow of blood, increases for low frequency information. Noise is 
less audible in lower frequencies than middle frequencies. 


perceptual 
frequency scales 


Depending on the frequency of the audio signal, hair cells at 
different positions in the inner ear react, which affects the pitch that 
a human perceives. Critical bands relate frequency to pitch. 


Excitation 


Hair cells typically respond several milliseconds after the onset of 
the audio signal at a frequency. After exposure, hair cells and 
neural processes need time to recover full sensitivity. Moreover, 
loud signals are processed faster than quiet signals. Noise can be 
masked when the ear will not sense it. 


Detection 


Humans are better at detecting changes in loudness for quieter 
signals than louder signals. Noise can be masked in quieter 
signals. 


simultaneous 
masking 


For a masker and maskee present at the same time, the maskee is 
masked at the frequency of the masker but also at frequencies 
above and below the masker. The amount of masking depends on 
the masker and maskee structures and the masker frequency. 


temporal 
masking 


The masker has a masking effect before and after than the masker 
itself. Generally, fonA/ard masking is more pronounced than 
backward masking. The masking effect diminishes further away 
from the masker in time. 


loudness 


Perceived loudness of a signal depends on frequency, duration, 
and sound pressure level. The components of a signal partially 
mask each other, and noise can be masked as a result. 


cognitive 


Cognitive effects influence perceptual audio quality. Abrupt 
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processing 



changes in quality are objectionable. Different components of an 
audio signal are important In different applications (e.g., speech vs. 
music). 



Table 2: Various factors that relate to perception of audio 

An auditory model can consider any of the factors shown in Table 2 as well as 
5 other factors relating to physical or neural aspects of human perception of sound. For 
more information about auditory models, see: 

1) Zwicker and Feldtkeller, "Das Ohr als Nachrichtenempfanger," Hirzel-Verlag, 
Stuttgart, 1967; 

2) Terhardt, "Calculating Virtual Pitch," Hearing Research, 1:155-182, 1979; 
1 0 3) Lufti, "Additivity of Simultaneous Masking," Journal of Acoustic Society of 

America, 73:262 267, 1983; 

4) Jesteadt et aL, "Fonward Masking as a Function of Frequency, Masker Level, 
and Signal Delay," Journal of Acoustical Society of America, 71:950-962, 1982; 

5) ITU, Recommendation ITU-R BS 1387, Method for Objective Measurements of 
1 5 Perceived Audio Quality, 1 998; 

6) Beerends, "Audio Quality Detennination Based on Perceptual Measurement 
Techniques," Applications of Digital Signal Processing to Audio and Acoustics , Chapter 
1, Ed. Mark Kahrs, Karlheinz Brandenburg, Kluwer Acad. Publ., 1998; and 

7) Zwicker, Psvchoakustik , Springer-Verlag, Berlin Heidelberg, New York, 1982. 

20 

ML Measuring Audio Quality 

In various applications, engineers measure audio quality. For example, quality 
measurement can be used to evaluate the performance of different audio encoders or 
other equipment, or the degradation introduced by a particular processing step. For 

25 some applications, speed is emphasized over accuracy. For other applications, quality 
is measured off-line and more rigorously. 

Subjective listening tests are one way to measure audio quality. Different 
people evaluate quality differently, however, and even the same person can be 
inconsistent over time. By standardizing the evaluation procedure and quantifying the 

30 results of evaluation, subjective listening tests can be made more consistent, reliable, 
and reproducible. In many applications, however, quality must be measured quickly or 



SAW: 12/14/01 3382-61344 180529.1 

7 



Express Mail No. EL 874429730 US 



m 



results must be very consistent over time, so subjective listening tests are 
inappropriate. 

Conventional measures of objective audio quality include signal to noise ratio 
["SNR"] and distortion of the reconstructed audio signal compared to the original audio 
5 signal. SNR is the ratio of the amplitude of the noise to the amplitude of the signal, and 
is usually expressed in terms of decibels. Distortion D can be calculated as the 
square of the differences between original values and reconstructed values. 

D = (u^q(u)Qy (1) 
where u is an original value, q(u) is a quantized version of the original value, and Q 
10 is a quantization factor. Both SNR and distortion are simple to calculate, but fail to 
Q account for the audibility of noise. Namely, SNR and distortion fail to account for the 

varying sensitivity of the human ear to noise at different frequencies and levels of 
P loudness, interaction with other sounds present in the signal (i.e., masking), or the 

ii physical limitations of the human ear (i.e., the need to recover sensitivity). Both SNR 

1 5 and distortion fail to accurately predict perceived audio quality in many cases, 
y ITU-R BS 1387 is an international standard for ot)jectiveiy measuring perceived 

audio quality. The standard describes several quality measurement techniques and 
n auditory models. The techniques measure the quality of a test audio signal compared 

to a reference audio signal, in mono or stereo mode. 
20 Figure 1 shows a masked threshold approach (100) to measuring audio quality 

described in ITU-R BS 1387, Annex 1 , Appendix 4, Sections 2, 3, and 4.2. In the 
masked threshold approach (100), a first time to frequency mapper (1 10) maps a 
reference signal (102) to frequency data, and a second time to frequency mapper (120) 
maps a test signal (104) to frequency data. A subtracter (130) determines an error 
25 signal from the difference between the reference signal frequency data and the test 
signal frequency data. An auditory modeler (140) processes the reference signal 
frequency data, including calculation of a masked threshold for the reference signal. 
The en^or to threshold comparator (150) then compares the enror signal to the masked 
threshold, generating an audio quality estimate (152), for example, based upon the 
30 differences in levels between the error signal and the masked threshold. 
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ITU-R BS 1387 describes in greater detail several other quality measures and 
auditory models. In a FFT-based ear model, reference and test signals at 48 kHz are 
each split into windows of 2048 samples such that there Is 50% overlap across 
consecutive windows. A Hann window function and FFT are applied, and the resulting 
5 frequency coefficients are filtered to model the filtering effects of the outer and middle 
ear. An error signal is calculated as the difference between the frequency coefficients 
of the reference signal and those of the test signal. For each of the error signal, the 
reference signal, and the test signal, the energy is calculated by squaring the signal 
values. The energies are then mapped to critical bands/pitches. For each critical 
=.^, 1 0 band, the energies of the coefficients contributing to (e.g., within) that critical band are 
Q added together. For the reference signal and the test signal, the energies for the 

'^^ critical bands are then smeared across frequencies and time to model simultaneous 

and temporal masking. The outputs of the smearing are called excitation patterns. A 
masking threshold can then be calculated for an excitation pattern: 

15 M[k,n] = -^ (2) 

N 10 

2 for m[k] = 3.0 if A; * res < 12 and m[k] = k*res \i k*res>\l, where k is the aitical 

% band, res is the resolution of the band scale In terms of Bark bands, n is the frame, 

I* and E[k,n\ is the excitation pattern. 

From the excitation pattems, enror signal, and other outputs of the ear model, 
20 ITU-R BS 1387 describes calculating Model Output Variables ["MOVs"]. One MOV is 
the average noise to mask ratio [" NMR "] for a frame: 

W.^M = 10..og„i|i^ (3) 

Where n is the frame number, Z is the number of critical bands per frame, P^j^X^M 
is the noise pattern, and M[k,n] is the masking threshold. NMR can also be 
25 calculated for a whole signal as a combination of NMR values for frames. 

In ITU-R BS 1387, NMR and other MOVs are weighted and aggregated to give 
a single output quality value. The weighting ensures that the single output value is 
consistent with the results of subjective listening tests. For stereo signals, the linear 
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average of MOVs for the left and right channels is taken. For more information about 
the FFT-based ear model and calculation of NMR and other MOVs. see ITU-R BS 
1387, Annex 2, Sections 2.1 and 4-6. ITU-R BS 1387 also describes a filter bank- 
based ear model. The Beerends reference also describes audio quality measurement, 
as does Solari, Digital Video and Audio Compression . "Chapters: Sound and Audio," 
McGraw-Hill, Inc., pp. 187-212 (1997). 

Compared to subjective listening tests, the techniques described in ITU-R BS 
1387 are more consistent and reproducible. Nonetheless, the techniques have several 
shortcomings. First, the techniques are complex and time-consuming, which limits 
their usefulness for real-time applications. For example, the techniques are too 
complex to be used effectively in a quantization loop in an audio encoder. Second, the 
NMR of ITU-R BS 1387 measures perceptible degradation compared to the masking 
threshold for the original signal, which can inaccurately estimate the perceptible 
degradation for a listener of the reconstructed signal. For example, the masking 
threshold of the original signal can be higher or lower than the masking threshold of the 
reconstructed signal due to the effects of quantization. A masking component in the 
original signal might not even be present in the reconstructed signal. Third, the NMR 
of ITU-R BS 1387 fails to adequately weight NMR on a per-band basis, which limits Its 
usefulness and adaptability. Aside from these shortcomings, the techniques described 
in ITU-R BS 1387 present several practical problems for an audio encoder. The 
techniques presuppose input at a fixed rate (48 kHz). The techniques assume fixed 
transform block sizes, and use a transform and window function (in the FFT-based ear 
model) that can be different than the transform used in the encoder, which is inefficient. 
Finally, the number of quantization bands used in the encoder is not necessarily equal 
to the number of critical bands in an auditory model of ITU-R BS 1387. 

Microsoft Corporation's Windows Media Audio version 7.0 ["WMA7"] partially 
addresses some of the problems with implementing quality measurement in an audio 
encoder. In WMA7, the encoder may jointly code the left and right channels of stereo 
mode audio into a sum channel and a difference channel. The sum channel is the 
averages of the left and right channels; the difference channel is ttie differences 
between the left and right channels divided by two. The encoder calculates a noise 
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Signal for each of the sum channel and the difference channel, where the noise signal 
is the difference between the original channel and the reconstructed channel. The 
encoder then calculates the maximum Noise to Excitation Ratio ["NER"] of all 
quantization bands in the sum channel and difference channel: 



NER^^^u, =max 



max^ 



[d] 



,max^ 



(4) 



where d is the quantization band number, max^ is the maximum value across all d , 

and E^^[d], EsJ.d], F„^{d],ax\(i F,„Jf/] are the excitation pattemforthe 

difference channel, the excitation pattern for the sum channel, the noise pattern of the 
difference channel, and the noise pattern of the sum channel, respectively, for 
quantization bands. In WMA7, calculating an excitation or noise pattern includes 
squaring values to detennine energies, and then, for each quantization band, adding 
the energies of the coefficients within that quantization band. If WMA7 does not use 
jointly coded channels, the same equation is used to measure the quality of left and 
right channels. That is. 



r 



NER„ 



= max 



max^ 



max^ 



(5) 



WMA7 works in real time and measures audio quality for input with rates other 
than 48 kHz. WMA7 uses a MLT with variable transform block sizes, and measures 
audio quality using the same frequency coefficients used in compression. WMA7 does 
not address several of the problems of ITU-R BS 1387, however, and WMA7 has 
several other shortcomings as well, each of which decreases the accuracy of the 
measurement of perceptual audio quality. First, although the quality measurement of 
WMA7 is simple enough to be used in a quantization loop of the audio encoder, it does 
not adequately correlate with actual human perception. As a result, changes in quality 
in order to keep constant bit rate can be dramatic and perceptible. Second, the NER 
of WMA7 measures perceptible degradation compared to the excitation pattern of the 
original information (as opposed to reconstructed information), which can inaccurately 
estimate perceptible degradation for a listener of the reconstructed signal. Third, the 
NER dl'^^Ukl fails to adequately weight NER on a per-band basis, which limits its 
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usefulness and adaptability. Fourtii, although WMA7 worl<s with variable-size 
transform blocks, WMA7 is unable perfonn operations such as temporal masking 
between blocks due to the variable sizes. Fifth, WMA7 measures quality with respect 
to excitation and noise patterns for quantization bands, which are not necessarily 
5 related to a model of human perception with critical bands, and which can be different 
in different variable-size blocks, preventing comparisons of results. Sixth, WMA7 
measures the maximum NER for all quantization bands of a channel, which can 
inappropriately ignore the contribution of NER s for other quantization bands. Seventh, 
WMA7 applies the same quality measurement techniques whether independently or 
1 0 jointly coded channels are used, which ignores differences between the two channel 
modes. 

Aside from WMA7, several intemational standards describe audio encoders 
that incorporate an auditory model. The Motion Picture Experts Group, Audio Layer 3 
["MP3"] and Motion Picture Experts Group 2, Advanced Audio Coding ["AAC"] 

1 5 standards each describe techniques for measuring distortion in a reconstructed audio 
signal against thresholds set with an auditory model. 

in MPS, the encoder incorporates a psychoacoustic model to calculate Signal to 
Mask Ratios ["SMRs"] for frequency ranges called threshold calculation partitions. In a 
path separate from the rest of the encoder, the encoder processes the original audio 

20 information according to the psychoacoustic model. The psychoacoustic model uses a 
different frequency transform than the rest of the encoder (FFT vs. hybrid 
polyphase/MDCT filter bank) and uses separate computations for energy and other 
parameters. In the psychoacoustic model, the MP3 encoder processes blocks of 
frequency coefficients according to the threshold calculation partitions, which have sub- 

25 Bark band resolution (e.g., 62 partitions for a long block of 48 kHz input). The encoder 
calculates a SMR for each partition. The encoder converts the SMRs for the partitions 
into SMRs for scale factor bands. A scale factor band is a range of frequency 
coefficients for which the encoder calculates a weight called a scale factor. The 
number of scale factor bands depends on sampling rate and block size (e.g., 21 scale 

30 factor bands for a long block of 48 kHz input). The encoder later converts the SMRs 
for the scale factor bands into allowed distortion thresholds for the scale factor bands. 
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In an outer quantization loop, the MPS encoder compares distortions for scale 
factor bands to the allowed distortion thresholds for the scale factor bands. Each scale 
factor starts with a minimum weight for a scale factor band. For the starling set of 
scale factors, the encoder finds a satisfactory quantization step size in an inner 
5 quantization loop. In the outer quantization loop, the encoder amplifies the scale 
factors until the distortion in each scale factor band is less than the allowed distortion 
threshold for that scale factor band, with the encoder repeating the inner quantization 
loop for each adjusted set of scale factors. In special cases, the encoder exits the 
outer quantization loop even if distortion exceeds the allowed distortion threshold for a 

1 0 scale factor band (e.g., if ail scale factors have been amplified or if a scale factor has 
reached a maximum amplification). 

Before the quantization loops, the MP3 encoder can switch between long 
blocks of 576 frequency coefficients and short blocks of 192 frequency coefficients 
(sometimes called long windows or short windows). Instead of a long block, the 

1 5 encoder can use three short blocks for better time resolution. The number of scale 

factor bands is different for short blocks and long blocks (e.g., 12 scale factor bands vs. 
21 scale factor bands). The MPS encoder runs the psychoacoustic model twice (in 
parallel, once for long blocks and once for short blocks) using different techniques to 
calculate SMR depending on the block size. 

20 The MPS encoder can use any of several different coding channel modes, 

including single channel, two independent channels (left and right channels), or two 
jointly coded channels (sum and difference channels). If the encoder uses jointly 
coded channels, the encoder computes a set of scale factors for each of the sum and 
difference channels using the same techniques that are used for left and right 

25 channels. Or, if the encoder uses jointly coded channels, the encoder can instead use 
intensity stereo coding. Intensity stereo coding changes how scale factors are 
determined for higher frequency scale factor bands and changes how sum and 
difference channels are reconstructed, but the encoder still computes two sets of scale 
factors for the two channels. 

30 For additional information about MPS and AAC, see the MPS standard 

("ISO/IEC 1 1 172-3, Information Technology - Coding of Moving Pictures and 
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Associated Audio for Digital Storage Media at Up to About 1 .5 Mbit/s - Part 3: Audio") 
and tine AAC standard. 

Although MP3 encoding has achieved widespread adoption, it is unsuitable for 
some applications (for example, real-time audio streaming at very low to mid bit rates) 
for several reasons. First, calculating SMRs and allowed distortion thresholds with 
MP3's psychoacoustic model occurs outside of the quantization loops. The 
psychoacoustic model is too complex for some applications, and cannot be Integrated 
into a quantization loop for such applications. At the same time, as the psychoacoustic 
model is outside of the quantization loops, it works with original audio information (as 
opposed to reconstructed audio information), which can lead to inaccurate estimation 
of perceptible degradation for a listener of the reconstructed signal at lower bit rates. 
Second, the MPS encoder fails to adequately weight SMRs and allowed distortion 
thresholds on a per-band basis, which limits the usefulness and adaptability of the MPS 
encoder. Third, computing SMRs and allowed distortion thresholds in separate tracks 
for long blocks and short blocks prevents or complicates operations such as temporal 
spreading or comparing measures for blocks of different sizes. Fourth, the MPS 
encoder does not adequately exploit differences between independently coded 
channels and jointly coded channels when calculating SMRs and allowed distortion 
thresholds. 

SUMMARY 

Embodiments of an audio encoder are desaibed herein that digitally encode 
audio signals with improved audio quality. 

In a first audio encoding technique, an audio encoder dynamically selects 
between joint and independent coding of a multi-channel audio signal using an open- 
loop selection decision based upon (a) energy separation between the coding 
channels, and (b) the disparity between excitation patterns of the separate input 
channels. 

In a second audio encoding technique, an audio encoder performs band 
truncation to suppress a few higher frequency transform coefficients, so as to pennit 
better coding of surviving coefficients. In one implementation, the audio encoder 
detemiines a cut-off frequency as a function of a perceptual quality measure (e.g., a 
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noise-to-excitation ratio ("NER") of the input signal). This way. if the content being 
compressed is not complex, less of such filtering is performed. 

In a third audio encoding technique, an audio encoder performs channel re- 
matrixing when jointly encoding a multi-channel audio signal. In one implementation, 
5 the audio encoder suppresses certain coefficients of a difference channel by scaling 
according to a scale factor, which is based on (a) current average levels of perceptual 
quality, (b) current rate control buffer fullness, (c) coding mode (e.g., bit rate and 
sample rate settings, etc.), and (d) the amount of channel separation in the source. 
For example, if the current average perceptual quality measure indicates poor 

1 0 reproduction, the scale factor is varied to cause severe suppression of the difference 
channel in re-matrixing. Similar severe re-matrixing is performed as the rate control 
buffer approaches fullness. Conversely, if the two channels of the input audio signal 
significantly differ, the scale factor is varied so that little or no re-matrixing takes place. 
In a fourth audio encoding technique, an audio encoder reduces the size of a 

1 5 quantization matrix in the encoded audio signal. The quantization matrix encodes 
quantizer step size of quantization bands of an encoded channel in the encoded audio 
signal. In one implementation, the quantization matrix is differentially encoded for 
successive frames of the audio signal. At certain (e.g., lower) coding rates, particular 
quantization bands may be quantized to all zeroes (e.g., due to quantization or band 

20 truncation). In such cases, the audio encoder reduces the bits needed to differentially 
encode the quantization matrices of successive frames by modifying the quantization 
step size of bands that are quantized to zero, so as to be differentially encoded using 
fewer bits. For example, the various bands that are quantized to zero may initially 
have various quantization step sizes. Via this technique, the audio encoder may adjust 

25 the quantization step sizes of these bands to be identical so that they may be 
differentially encoded in the quantization matrix using fewer bits. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a diagram of a masked threshold approach to measuring audio 
30 quality according to the prior art. 

Figure 2 is a block diagram of a suitable computing environment for an audio 
encoder incorporating quality enhancement techniques described herein. 
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Figures 3 and 4 are a block diagram of an audio encoder and decoder in wliich 
quality enhancement techniques described herein are incorporated. 

Figure 5 is a flow diagram of joint channel coding in the audio encoder of Figure 

3. 

Figure 6 is a flow diagram of independent channel coding in the audio encoder 
of Figure 3. 

Figure 7 is a flow chart of a multi-channel coding decision process in the audio 
encoder of Figure 3. 

Figure 8 is a graph of cutoff frequency for band truncation as a function of a 
perceptual quality measure in the audio encoder of Figure 3. 

Figure 9 is a data flow diagram of a pre-encoding band truncation process 
based on a target quality measure in the audio encoder of Figure 3. 

Figure 10 is a data flow diagram of a multi-channel rematrixing process in the 
audio encoder of Figure 3. 

Figure 1 1 is a flow chart of a quantization step-size modification process for 
header bit reduction in the audio encoder of Figure 3. 

Figure 12 is a graph of an example of quantization step-size modification to 
reduce header bits. 

Figure 13 is a chart showing a mapping of quantization bands to critical bands 
according to the illustrative embodiment. 

Figures 14a-14d are diagrams showing computation of NER in an audio 
encoder according to the illustrative embodiment. 

Figure 15 is a flowchart showing a technique for measuring the quality of a 
normalized block of audio information according to the illustrative embodiment. 

Figure 16 is a graph of an outer/middle ear transfer function according to the 
illustrative embodiment. 

Figure 17 is a flowchart showing a technique for computing an effective 
masking measure according to the illustrative embodiment. 

Figure 18 is a flowchart showing a technique for computing a band-weighted 
quality measure according to the illustrative embodiment. 

Figure 19 is a graph showing a set of perceptual weights for critical band 
according to the illustrative embodiment. 
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Figure 20 is a flowcliart showing a teclinlque for measuring audio quality In a 
coding channel mode-dependent manner according to the illustrative emtwdiment. 

DETAILED DESCRIPTION 

5 The following detailed description addresses embodiments of an audio encoder 

that Implements various audio quality improvements. The audio encoder incorporates 
an improved multi-channel coding decision based on energy separation and excitation 
pattern disparity between channels. The audio encoder further performs band 
truncation at a cut-off frequency based on a perceptual quality measure. The audio 

1 0 encoder also performs multi-channel rematrixing with suppression based on (a) current 
average levels of perceptual quality, (b) current rate control buffer fullness, (c) coding 
mode (e.g., bit rate and sample rate settings, etc.), and (d) the amount of channel 
separation in the source. The audio encoder also adjusts step size of zero-quantized 
quantization bands for efficient coding of the quantization matrix, such as in frame 

1 5 headers. 

I. Computing Environment 

Figure 2 illustrates a generalized example of a suitable computing environment 
(200) in which the illustrative embodiment may be implemented. The computing 

20 environment (200) is not intended to suggest any limitation as to scope of use or 

functionality of the invention, as the present invention may be implemented in diverse 
general-purpose or special-purpose computing environments. 

With reference to Figure 2, the computing environment (200) includes at least 
one processing unit (210) and memory (220). In Figure 2, this most basic configuration 

25 (230) is included within a dashed line. The processing unit (210) executes computer- 
executable instructions and may be a real or a virtual processor. In a multi-processing 
system, multiple processing units execute computer-executable instructions to increase 
processing power. The memory (220) may be volatile memory (e.g., registers, cache, 
RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some 

30 combination of the two. The memory (220) stores software (280) implementing an 
audio encoder. 
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A computing environment may have additional features. For example, the 
computing environment (200) includes storage (240), one or more input devices (250), 
one or more output devices (260), and one or more communication connections (270). 
An interconnection mechanism (not shown) such as a bus, controller, or network 
5 interconnects the components of the computing environment (200). Typically, 

operating system software (not shown) provides an operating environment for other 
software executing in the computing environment (200), and coordinates activities of 
the components of the computing environment (200). 

The storage (240) may be removable or non-removable, and includes magnetic 

1 0 disks, magnetic tapes or cassettes, CD-ROMs, CD-RWs, DVDs, or any other medium 
which can be used to store information and which can be accessed within the 
computing environment (200). The storage (240) stores instructions for the software 
(280) implementing the audio encoder. 

The input device(s) (250) may be a touch input device such as a keyboard, 

1 5 mouse, pen, or trackball, a voice Input device, a scanning device, or another device 
that provides input to the computing environment (200). For audio, the input device(s) 
(250) may be a sound card or similar device that accepts audio input in analog or 
digital fomn. The output device(s) (260) may be a display, printer, speaker, or another 
device that provides output from the computing environment (200). 

20 The communication connection(s) (270) enable communication over a 

communication medium to another computing entity. The communication medium 
conveys information such as computer-executable instructions, compressed audio or 
video information, or other data in a modulated data signal. A modulated data signal is 
a signal that has one or more of its characteristics set or changed in such a manner as 

25 to encode information in the signal. By way of example, and not limitation, 

communication media include wired or wireless techniques implemented with an 
electrical, optical, RF, infrared, acoustic, or other carrier. 

The invention can be described in the general context of computer-readable 
media. Computer-readable media are any available media that can be accessed within 

30 a computing environment. By way of example, and not limitation, with the computing 
environment (200), computer-readable media include memory (220), storage (240), 
communication media, and combinations of any of the above. 
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The invention can be described in the general context of computer-executable 
instructions, such as those included in program modules, being executed in a 
computing environment on a target real or virtual processor. Generally, program 
modules include routines, programs, libraries, objects, classes, components, data 
structures, etc. that perform particular tasks or implement particular abstract data 
types. The functionality of the program modules may be combined or split between 
program modules as desired in various embodiments. Computer-executable 
instructions for program modules may be executed within a local or distributed 
computing environment. 

For the sake of presentation, the detailed description uses terms like 
"determine," "get," "adjust," and "apply" to describe computer operations in a computing 
environment. These terms are high-level abstractions for operations performed by a 
computer, and should not be confused with acts performed by a human being. The 
actual computer operations corresponding to these ternis vary depending on 
implementation. 

II. Generalized Audio Encoder and Decoder 

Figure 3 is a block diagram of a generalized audio encoder (300). The 
relationships shown between modules within the encoder and decoder indicate the 
main flow of information in the encoder and decoder; other relationships are not shown 
for the sake of simplicity. Depending on implementation and the type of compression 
desired, modules of the encoder or decoder can be added, omitted, split into multiple 
modules, combined with other modules, and/or replaced with like modules. In 
alternative embodiments, encoders or decoders with different modules and/or other 
configurations of modules measure perceptual audio quality. 

A. Generalized Audio Encoder 

The generalized audio encoder (300) includes a frequency transformer (310), a 
multi-channel transfonner (320), a perception modeler (330), a weighter (340), a 
quantizer (350), an entropy encoder (360), a rate/quality controller (370), and a 
bitstream multiplexer ["MUX"] (380). 
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The encoder (300) receives a time series of Input audio samples (305) in a 
format such as one shown in Table 1. For input with multiple channels (e.g., stereo 
mode), the encoder (300) processes channels Independently, and can work with jointly 
coded channels following the multi-channel transfomner (320). The encoder (300) 
5 compresses the audio samples (305) and multiplexes infonmation produced by the 
various modules of the encoder (300) to output a bitstream (395) in a fonmat such as 
Windows Media Audio ["WMA"] or Advanced Streaming Fonnat ["ASF"]. Alternatively, 
the encoder (300) works with other input and/or output formats. 

The frequency transformer (310) receives the audio samples (305) and 

1 0 converts them Into data in the frequency domain. The frequency transformer (310) 
splits the audio samples (305) into blocks, which can have variable size to allow 
variable temporal resolution. Small blocks allow for greater preservation of time detail 
at short but active transition segments in the Input audio samples (305), but sacrifice 
some frequency resolution. In contrast, large blocks have better frequency resolution 

1 5 and worse time resolution, and usually allow for greater compression efficiency at 
longer and less active segments. Blocks can overlap to reduce perceptible 
discontinuities between blocks that could othenwise be introduced by later quantization. 
The frequency transformer (310) outputs blocks of frequency coefficient data to the 
multi-channel transformer (320) and outputs side infomnation such as block sizes to the 

20 MUX (380). The frequency transfomner (31 0) outputs both the frequency coefficient 
data and the side information to the perception modeler (330). 

The frequency transfomner (310) partitions a frame of audio input samples (305) 
into overlapping sub-frame blocks with time-varying size and applies a time-varying 
MLT to the sub-frame blocks. Possible sub-frame sizes include 128, 256, 512, 1024, 

25 2048, and 4096 samples. The MLT operates like a DCT modulated by a time window 
function, where the window function is time varying and depends on the sequence of 
sub-frame sizes. The MLT transforms a given overlapping block of samples 
jc[n],0 < n < subframe _size into a block of frequency coefficients 
X[k],0 <k< subframe _size 1 2 . The frequency transformer (310) can also output 

30 estimates of the complexity of future frames to the rate/quality controller (370). 
Altemative embodiments use other varieties of MLT. In still other alternative 
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embodiments, the frequency transfomner (310) applies a DCT, FFT, or other type of 
modulated or non-modulated, overlapped or non-overlapped frequenqr transfomn, or 
use subband or wavelet coding. 

For multi-channel audio data, the multiple channels of frequency coefficient 
data produced by the frequency transformer (310) often correlate. To exploit this 
con-elation, the multi-channel transformer (320) can convert the multiple original, 
independently coded channels into jointly coded channels. For example, if the input is 
stereo mode, the multi-channel transformer (320) can convert the left and right 
channels into sum and difference channels: 

Or, the multi-channel transfomier (320) can pass the left and right channels 
through as independently coded channels. More generally, for a number of input 
channels greater than one. the multi-channel transfomier (320) passes original, 
Independently coded channels through unchanged or converts the original channels 
into jointly coded channels. The decision to use independently or jointly coded 
channels can be predetennined, or the decision can be made adaptively on a block by 
block or other basis during encoding. The multi-channel transformer (320) produces 
side information to the MUX (380) indicating the channel mode used. 

The perception modeler (330) models properties of the human auditory system 
to improve the quality of the reconstructed audio signal for a given bit rate. The 
perception modeler (330) computes the excitation pattern of a variable-size block of 
frequency coefficients. First, the perception modeler (330) normalizes the size and 
amplitude scale of the block. This enables subsequent temporal smearing and 
establishes a consistent scale for quality measures. Optionally, the perception modeler 
(330) attenuates the coefficients at certain frequencies to model the outer/middle ear 
transfer function. The perception modeler (330) computes the energy of the 
coefficients in the block and aggregates the energies by 25 critical bands. 
Alternatively, the perception modeler (330) uses another number of critical bands (e.g., 
55 or 109). The frequency ranges for the critical bands are implementation-dependent. 
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and numerous options are well known. For example, see ITU-R BS 1387 or a 
reference mentioned therein. The perception modeler (330) processes the band 
energies to account for simultaneous and temporal masking. In alternative 
embodiments, the perception modeler (330) processes the audio data according to a 
5 different auditory model, such as one described or mentioned in ITU-R BS 1387. 
The weighter (340) generates weighting factors (alternatively called a 
quantization matrix) based upon the excitation pattern received from the perception 
modeler (330) and applies the weighting factors to the data received from the multi- 
channel transformer (320). The weighting factors include a weight for each of multiple 
1 0 quantization bands in the audio data. The quantization bands can be the same or 
^^•^ different in number or position from the critical bands used elsewhere in the encoder 

Q (300). The weighting factors indicate proportions at which noise is spread across the 

quantization bands, with the goal of minimizing the audibility of the noise by putting 
'^n more noise in bands where it is less audible, and vice versa. The weighting factors can 

1 5 vary in amplitudes and number of quantization bands from block to block. In one 
1 implementation, the number of quantization bands varies according to block size; 

HJ smaller blocks have fewer quantization bands than larger blocks. For example, blocks 

yi with 128 coefficients have 13 quantization bands, blocks with 256 coefficients have 15 

quantization bands, up to 25 quantization bands for blocks with 2048 coefficients. The 
20 weighter (340) generates a set of weighting factors for each channel of multi-channel 
audio data in independently coded channels, or generates a single set of weighting 
factors for jointly coded channels. In alternative embodiments, the weighter (340) 
generates the weighting factors from infonnation other than or in addition to excitation 
patterns. 

25 The weighter (340) outputs weighted blocks of coefficient data to the quantizer 

(350) and outputs side information such as the set of weighting factors to the MUX 
(380). The weighter (340) can also output the weighting factors to the rate/quality 
controller (340) or other modules in the encoder (300). The set of weighting factors 
can be compressed for more efficient representation. If the weighting factors are lossy 

30 compressed, the reconstructed weighting factors are typically used to weight the blocks 
of coefficient data. If audio information in a band of a block is completely eliminated for 
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some reason (e.g., noise substitution or band truncation), tiie encoder (300) may be 
able to further improve the compression of the quantization matrix for the block. 

The quantizer (350) quantizes the output of the weighter (340), producing 
quantized coefficient data to the entropy encoder (360) and side infonmation including 
5 quantization step size to the MUX (380). Quantization introduces irreversible loss of 
information, but also allows the encoder (300) to regulate the bit rate of the output 
bitstream (395) in conjunction with the rate/quality controller (370). In Figure 3, the 
quantizer (350) is an adaptive, uniform scalar quantizer. The quantizer (350) applies 
the same quantization step size to each frequency coefficient, but the quantization step 
1 0 size itself can change from one iteration to the next to affect the bit rate of the entropy 
encoder (360) output. In alternative embodiments, the quantizer is a non-uniform 
;| quantizer, a vector quantizer, and/or a non-adaptive quantizer. 

The entropy encoder (360) losslessly compresses quantized coefficient data 
%y received from the quantizer (350). For example, the entropy encoder (360) uses multi- 

1 5 level run length coding, variable-to-variable length coding, run length coding, Huffman 
coding, dictionary coding, arithmetic coding, LZ coding, a combination of the above, or 
some other entropy encoding technique. 

I u 

l ^i The rate/quality controller (370) works with the quantizer (350) to regulate the 

bit rate and quality of the output of the encoder (300). The rate/quality controller (370) 
20 receives information from other modules of the encoder (300). In one implementation, 
the rate/quality controller (370) receives estimates of future complexity from the 
frequency transformer (310), sampling rate, block size information, the excitation 
pattern of original audio data from the perception modeler (330), weighting factors from 
the weighter (340), a block of quantized audio infonnation in some form (e.g., 
25 quantized, reconstructed, or encoded), and buffer status information from the MUX 
(380). The rate/quality controller (370) can include an inverse quantizer, an inverse 
weighter, an inverse multi-channel transformer, and, potentially, an entropy decoder 
and other modules, to reconstruct the audio data from a quantized form. 

The rate/quality controller (370) processes the information to determine a 
30 desired quantization step size given current conditions and outputs the quantization 
step size to the quantizer (350). The rate/quality controller (370) then measures the 
quality of a block of reconstructed audio data as quantized with the quantization step 
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size, as described below. Using the measured quality as well as bit rate information, 
the rate/quality controller (370) adjusts the quantization step size with the goal of 
satisfying bit rate and quality constraints, both instantaneous and long-tenn. In 
alternative embodiments, the rate/quality controller (370) applies works with different or 
5 additional information, or applies different techniques to regulate quality and bit rate. 

In conjunction with the rate/quality controller (370), the encoder (300) can apply 
noise substitution, band truncation, and/or multi-channel rematrixing to a block of audio 
data. At low and mid-bit rates, the audio encoder (300) can use noise substitution to 
convey information in certain bands. In band truncation, if the measured quality for a 
1 0 block indicates poor quality, the encoder (300) can completely eliminate the 

t^i coefficients in certain (usually higher frequency) bands to improve the overall quality in 

O the remaining bands. In multi-channel rematrixing, for low bit rate, multi-channel audio 

data in jointly coded channels, the encoder (300) can suppress information in certain 

|p channels (e.g., the difference channel) to improve the quality of the remaining 

i2 1 5 channel(s) (e.g., the sum channel). 

^ The MUX (380) multiplexes the side information received from the other 

modules of the audio encoder (300) along with the entropy encoded data received from 
the entropy encoder (360). The MUX (380) outputs the information in WMA or in 
another format that an audio decoder recognizes. 
20 The MUX (380) includes a virtual buffer that stores the bitstream (395) to be 

output by the encoder (300). The virtual buffer stores a pre-determined duration of 
audio infomnation (e.g., 5 seconds for streaming audio) in order to smooth over short- 
term fluctuations in bit rate due to complexity changes in the audio. The virtual buffer 
then outputs data at a relatively constant bit rate. The current fullness of the buffer, the 
25 rate of change of fullness of the buffer, and other characteristics of the buffer can be 
used by the rate/quality controller (370) to regulate quality and bit rate. 

B. Generalized Audio Decoder 

With reference to Figure 4, the generalized audio decoder (400) includes a 
30 bitstream demultiplexer ["DEMUX"] (410), an entropy decoder (420), an inverse 

quantizer (430), a noise generator (440), an inverse weighter (450), an inverse multi- 
channel transformer (460), and an inverse frequency transformer (470). The decoder 
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(400) is simpler than tiie encoder (300) Is because the decoder (400) does not Include 
modules for rate/quality control. 

The decoder (400) receives a bitstream (405) of compressed audio data in 
WMA or another format. The bitstream (405) includes entropy encoded data as well as 
side information from which the decoder (400) reconstructs audio samples (495). For 
audio data with multiple channels, the decoder (400) processes each channel 
independently, and can work with jointly coded channels before the inverse multi- 
channel transformer (460). 

The DEMUX (410) parses information In the bitstream (405) and sends 
information to the modules of the decoder (400). The DEMUX (410) includes one or 
more buffers to compensate for short-term variations in bit rate due to fluctuations in 
complexity of the audio, network jitter, and/or other factors. 

The entropy decoder (420) losslessly decompresses entropy codes received 
from the DEMUX (410). producing quantized frequency coefficient data. The entropy 
decoder (420) typically applies the inverse of the entropy encoding technique used in 
the encoder. 

The inverse quantizer (430) receives a quantization step size from the DEMUX 
(410) and receives quantized frequency coefficient data from the entropy decoder 
(420). The inverse quantizer (430) applies the quantization step size to the quantized 
frequency coefficient data to partially reconstaict the frequency coefficient data. In 
altemative embodiments, the inverse quantizer applies the Inverse of some other 
quantization technique used in the encoder. 

The noise generator (440) receives from the DEMUX (410) indication of which 
bands in a block of data are noise substituted as well as any parameters for the form of 
the noise. The noise generator (440) generates the patterns for the indicated bands, 
and passes the Information to the Inverse welghter (450). 

The inverse welghter (450) receives the weighting factors from the DEMUX 
(410), patterns for any noise-substituted bands from the noise generator (440), and the 
partially reconstructed frequency coefficient data from the inverse quantizer (430). As 
necessary, the inverse weighter (450) decompresses the weighting factors. The 
inverse weighter (450) applies the weighting factors to the partially reconstructed 
frequency coefficient data for bands that have not been noise substituted. The inverse 
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weighter (450) then adds in the noise patterns received from the noise generator (440). 

The inverse multi-channel transformer (460) receives the reconstructed 
frequency coefficient data from the inverse weighter (450) and channel mode 
information from the DEMUX (410). If multi-channel data is in independently coded 
5 channels, the inverse multi-channel transfomner (460) passes the channels through. If 
multi-channel data is in jointly coded channels, the inverse multi-channel transformer 
(460) converts the data into independently coded channels. If desired, the decoder 
(400) can measure the quality of the reconstructed frequency coefficient data at this 
point. 

1 0 The inverse frequency transformer (470) receives the frequency coefficient data 

output by the multi-channel transformer (460) as well as side information such as block 
sizes from the DEMUX (410). The inverse frequency transformer (470) applies the 
inverse of the frequency transform used in the encoder and outputs blocks of 
reconstructed audio samples (495). 

15 

III- Multi-Channel Coding Decision 

As described above, the audio encoder 300 (Figure 3) can dynamically decide 
between encoding a multiple channel input audio signal in a joint channel coding mode 
or an independent channel coding mode, such as on a block-by-block or other basis, 

20 for improved compression efficiency. In joint channel coding 500 (Figure 5), the audio 
encoder applies a multi-channel transfomnation 510 on multiple channels of the input 
signal to produce coding channels, which are then transfonn encoded (e.g., via 
frequency transform, quantization, and entropy encoding processes described above). 
An example of a multi-channel transformation is the conversion of left and right stereo 

25 channels into sum and difference channels using the equations (1) and (2) given 
above. In alternative embodiments, the joint coding can be performed on other 
multiple channel input signals, such as 5.1 channel surround sound, etc. Various 
alternative multi-channel transformations can be used to combine input channel signals 
into coding channels for the joint channel coding of such other multiple channel signals. 

30 By contrast, the audio encoder 300 separately transform encodes the individual 

channels of a multiple channel input signal in independent channel coding 600 (Figure 
6). 
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Figure 7 shows one implementation of a multi-channel coding decision process 
700 perfomied in the audio encoder 300 (Figure 3) to decide the channel coding mode 
Goint channel coding 500 or independent channel coding 600). In this implementation, 
the multi-channel coding decision process 700 is an open-loop decision, which 

5 generally is less computationally expensive. In this open-loop decision process 700, 
the decision between channel coding modes is made based on: (a) energy separation 
between the coding channels, and (b) the disparity between excitation patterns of the 
individual input channels. This latter basis (excitation pattern disparity) for the multi- 
channel coding decision is beneficial in audio encoders in which the quantization 

1 0 matrices are forced to be the same for both coding channels when performing joint 

channel coding. If the aggregate excitation pattern used in generating the quantization 
matrix is severely mismatched with the excitation patterns of either of the coding 
channels, then the joint channel coding 500 in such audio encoders would produce a 
severe coding efficiency penalty. The excitation pattern of the audio signal is 

1 5 discussed in the section below, entitled, "Measuring Audio Quality." 

In the illustrated process 700, the audio encoder 300 decides the channel 
coding mode on a block basis. In other words, the process 700 is performed per input 
signal block as indicated at decision 770. Alternatively, the channel coding decision 
can be made on other bases. 

20 At a first action 71 0 in the process 700, the audio encoder 300 measures the 

energy separation between the coding channels with and without the multi-channel 
transformation 510. At decision 720, the audio encoder 300 then determines whether 
the energy separation of the coding channels with the multi-channel transformation is 
greater than that without the transformation. In the case of two stereo channels (left 

25 and right), the audio encoder can determine the energy is greater with the 

transformation if the following relation evaluates to true: 

Max(a,,a,) ^ Max(cj^,a,) 
Min(cr;, o-,) Min(<T^ , ) 

where Oi, Or , Os, and a^. refer to standard deviation in left, right, sum and difference 
channels, respectively, in either the time or frequency (transform) domain. If either 
30 denominator is zero, that corresponding ratio is taken to be a large value, e.g. infinity. 
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If the energy separation is greater with the multi-channel transformation at 
decision 720, the audio encoder 300 proceeds to also measure the disparity between 
excitation patterns of the individual input channels at action 730. In one 
implementation, the disparity in excitation patterns between the input channels is 
5 measured using the following calculation: 

Max [ ^^^^ channel E[b] of right channel ] 
* [ E[b] of right channel ' E[b] of left channel J 

where E[b] refers to the excitation pattern computed for critical band b. 

In a second implementation, the audio encoder 300 uses a ratio between the 
expected noise-to-excitation ratio (NER) of the two input channels as a measure of the 
1 0 disparity. The measurement of NER is discussed in more detail below in the section 
entitled, "Measuring Audio Quality." For joint coding mode, for a given channel c. the 
expected NER is given as: 



NER^ = Z^[b]^^ (10) 



b 

L, where E[b] is the aggregate excitation pattern of the input channels at critical band b, 

J J; 1 5 E[b] is the excitation pattern of channel c at critical band b, and W[b] is the weighting 
used in the NER computation described below in the section entitled, "Measuring Audio 
Quality." In one implementation, based on experimentation, p = 0.25. Alternatively, 
other calculations measuring disparity in the excitation patterns of the input channels 
can be used. 

20 At decision 740, the audio encoder compares the measurement of the input 

channel excitation pattern disparity to a pre-determined threshold. In one 
implementation example, the threshold rule is that the ratio of the expected NER of the 
two channels exceeds 2.0, and the smaller expected NER is greater than 0.001 . Other 
threshold values or rules can be used in alternative implementations of the audio 
25 encoder. 

If the disparity measurement does not exceed the threshold, the audio encoder 
300 decides to use joint channel coding 500 (Figure 5) for the block as indicated at 
action 750. Otherwise, if the disparity measurement exceeds the threshold, the audio 
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encoder 300 decides against joint cfiannel coding and instead uses independent 
cliannel coding 600 (Figure 6). 

The process 700 then continues with the next block of the input signal as 
indicated at decision 770. 

5 

IV, Band Truncation 

In audio encoding, a general rule of thumb can be expressed that "coding lower 
frequencies well" produces better sounding reconstructed audio than "coding all 
frequencies poorly." The audio encoder 300 (Figure 3) performs a band truncation 
1 0 process that applies this rule. In this band truncation process, the audio encoder 

eliminates a few higher frequency coefficients from the transform coefficients that are 
ri coded into the compressed audio stream. In other words, the audio encoder zeroes 

out or othenA/ise does not code the value of the eliminated transform coefficients. This 

m 

. q permits the surviving transform coefficients to be coded at a higher resolution at a 

^ 1 5 given coding bit rate. More specifically, the audio encoder 300 suppresses transform 
coefficients for frequencies above a cut-off frequency that is a function of the achieved 
perceptual audio quality (e.g., the NER value calculated as described below in the 
section entitled, "Measuring Audio Quality"). 

Figure 8 shows a graph 800 of one example of the cut-off frequency of the band 
20 truncation process as a function of the achieved NER value, where the cut-off 

frequency decreases (eliminating more transfonn coefficients from coding) as the NER 
value increases. In some audio encoders, the function relating cut-off frequency to 
NER value is coding mode dependent. Alternatively, various other functions relating 
the cut-off frequency of band truncation to an achieved quality measurement can be 
25 used. In another example, 20% of transform coefficients are truncated if the NER 
value is greater than or equal to 0.5 for an 8 KHz audio source and 8Kbps bit rate of 
compressed audio. 

Figure 9 shows an improved band truncation process 810 in the audio encoder 
300 (Figure 3). In the improved band truncation process 810, the audio encoder 300 
30 performs a first-pass band truncation as an open-loop computation based on a target 
NER for the audio signal, then performs a second band truncation as a closed-loop 
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computation based on the achieved NER after compression of the audio signal with the 
first-pass band truncation. 

The improved band truncation process 810 utilizes a combination of audio 
encoder components, including a target NER setting 820, a band truncation component 
5 830, encoding component 840, and quality measurement component 850. The target 
NER setting 820 provides the target NER for the audio signal to the band truncation 
component 830, which then performs the first-pass band truncation on the input audio 
signal using the cut-off frequency yielded from the target NER by the function shown in 
the graph 800 of Figure 8. The encoding component 840 performs encoding and 

1 0 decoding of the first-pass band truncated audio signal as described above with 
reference to the generalized encoder 300 (Figure 3) and decoder 400 (Figure 4), 
including frequency transform, quantization and inverse transform. The quality 
measurement component 850 then calculates the achieved NER for the now 
reconstructed audio signal as described below in the section entitled, "Measuring Audio 

15 Quality." The quality measurement component 850 provides feedback of the achieved 
NER to the band truncation component 830, which then performs the second-pass 
band truncation on the input audio signal using the cut-off frequency yielded from the 
achieved NER by the function shown in graph 800. The encoding component then 
performs final encoding of the input audio signal with the second-pass band truncation 

20 to produce the compressed audio signal stream 860. The illustrated improved band 
truncation process 810 is performed on a block basis on the input audio signal, but 
alternatively can be performed on other bases. 

The improved band truncation process 810 provides the benefit of yielding a 
more accurate achieved NER quality measure in the audio encoder 300, such as for 

25 use in closed-loop band truncation, and multi-channel re-matrixing, among other 
purposes. 

V. Multi-Channel Rematrixinq 

Figure 10 shows a multi-channel rematrixing process 900. When compressing 
30 a multi-channel audio signal at very low rates, the distortion (e.g., quantization noise) 
introduced in each channel can have a significant impact on the "stereo-image" upon 
play-back. The multi-channel re-matrixing process 900 can reduce the impact of audio 
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compression on the stereo image of a multi-channel audio signal, as well as Improve 
the joint-channel coding efficiency, by selectively suppressing certain coding channels 
in joint channel coding 500 (Figure 5). 

In one implementation of the multi-channel re-matrixing process 900, the audio 
5 encoder 300 (Figure 3) includes a channel suppressor component 910 following the 
multi-channel transformation 510. The audio encoder 300 calculates suppression 
parameters 920 for the multi-channel re-matrixing process 900. Based on the 
suppression parameters, the channel suppressor component 910 selectively 
suppresses certain of the coding channels. Upon later application of an inverse multi- 

1 0 channel transformation 930 (e.g., in the audio decoder 400 of Figure 4 for playback), 
this multi-channel re-matrixing process 900 produces re-matrixed multi-channel audio 
data with reduced impact of the distortion from compression on the stereo-image. 

In one embodiment, the suppression parameters 920 include a scaling factor 
(p) whose value is based on: (a) current average levels of a perceptual audio quality 

1 5 measure (e.g., the NER described in more detail below in the section entitled. 

"Measuring Audio Quality"), (b) current rate control buffer fullness, (c) the coding mode 
(e.g., the bit rate and sample rate settings, etc. of the audio encoder), and (d) the 
amount of channel separation in the source. More specifically, if the cun-ent average 
level of quality indicates poor reproduction, the value of the scaling factor (p) is made 

20 much smaller than unity so as to produce severe re-matrixing of the multi-channel 
audio signal. A similar measure is taken if the rate control buffer is close to being full. 
On the other hand, if the two channels in the input data are significantly different, the 
scaling factor (p) is made closer to unity, so that little or no re-matrixing takes place. 
In the case of two-channel stereo audio signal for example, the audio encoder 

25 300 (Figure 3) produces the sum and difference coding channels using the equations 
(6) and (7) with the multi-channel transformation 510 as described above. The coding 
channel suppression 910 can be described as scaling the difference channel by the 
scaling factor (p) in the following equation: 

x^[n]=P'X^[n] (11) 

30 The scaling factor (p) in this illustrated embodiment for two-channel stereo 

audio is calculated as follows. If the sample rate is greater than 32 KHz and the bit 
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rate is greater than 32 Kbps, then the scaling factor (p) is set equal to 1 .0. For other 
combinations of sample and bit rates, the audio encoder 300 first calculates the energy 
separation of the channels. The energy separation of left and right stereo channels is 
computed as: 



whose value is taken as a large quantity (> 100) if the denominator is zero. 

The audio encoder 300 then determines the scaling factor from the following 
tables (13-15), dependent on the perceptual quality measure (NER) and coefficient 
index (B) which are described in more detail below in the section entitled, "Measuring 
Audio Quality." If (sep<5), the scaling factor (p) is given as follows: 




(13) 



If (5 < sep < 100), the scaling factor (p) is given as follows: 
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9 

n 



{NER > 2.5)OR{Bp > 0.95) 
{NER > 2.25)OR{Bj, > 0.9) 
{NER >2)0R{Bp> 0.9) 
1%"^ [NER > 1 J5)OR{Bp > 0.9) 
Ij^g (iVEi? > 1 .5)0R{Bj, > 0.85) 
. 1 1//^ (JV£72 > 1 .25)Oi?(5^ > 0.85) 
(JV£:i? > 1 .0)OR{B,. > 0.85) 
{NER>0.75)OR{Bp>0.S) 



12. 



14 



15. 



(ATf/? > 0.5)C>i?(B^ > 0.75) 
{NER > 0.25} 



^ Otherwise 

If (100 < sep) , the scaling factor (p) is given as follows: 

l^g (Affii? > 2.SpR{B, > 0.95) 
l^g {NER > 2.2SpR{Bp > 0.9) 
l^g > 2.0>9i?(5f > 0.9) 

13^^ (TVEi? > 1 .15)0R{Bp > 0.9) 
l^g {NER > 1 .5)Oi?(5^ > 0.85) 
/l6 (^^^ ^ 1.25)Oi?(5^ > 0.85) 
(A®i?>1.0)Oi?fc>0.85) 
{NER>0.75)OR{Bp>0.S) 
l^g (yV^i? > 0.5pR{Bp > 0.75) 



l^g Otherwise 



(14) 



(15) 



Finally, the re-matrixed channels can then be obtained (e.g., in the inverse 
multi-channel transformation 930) through the following equations: 

x,[m] = x,[h]-^^[w] (17) 
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VI, Quantizer Step-Size Modification For Header Reduction 

Figure 1 1 shows a header reduction process 1 100 to further improve coding 
efficiency in the audio encoder 300 (Figure 3). in the audio encoder 300, a 
quantization matrix containing quantizer step size information for each quantization 
5 band of each coding channel is nonmaliy sent for every frame of coded data in the 

compressed audio data stream. These quantization matrices are differentially encoded 
(e.g., similar to differential pulse code modulation) in a header of each frame within the 
compressed audio stream produced by the audio encoder. The quantization matrix is 
described in further detail in the related patent application, entitled "Quantization 

10 IVIatrices For Digital Audio," which is incorporated herein by reference above. 

Generally at lower coding rates, the audio encoder 300 quantizes certain 
quantization band coefficients to all zeroes, such as due to quantization or due to the 
band tmncation process described above. In such case, the quantization step size for 
the zeroed quantization band is not needed by the decoder to decode the compressed 

1 5 audio signal stream. 

The header reduction process 1100 reduces the size of the header by 
selectively modifying the quantization step size of quantization band coefficients that 
are quantized, so that such quantization step sizes will differentially encode using 
fewer bits in the header. IVIore specifically, at action 11 10 in the header reduction 

20 process 1 1 00, the audio encoder 300 identifies which quantization bands are quantized 
to zero, either due to band truncation or because the value of the coefficient for that 
band is sufficiently small to quantize to zero. At action 1 120, the audio encoder 300 
modifies the quantization step size of the identified quantization bands to values that 
will be encoded in fewer bits in the header. 

25 Figure 1 2 shows a graph 1 200 of an example of quantization step-size 

modification for header reduction via the header reduction process 1 100. The values 
of the original quantization step sizes of the quantization bands for this frame of the 
audio signal is shown by the line labeled "quant, step before bit reduction" in graph 
1200. In this example, quantization bands numbered 2 through 20 are quantized to 

30 zero (as indicated by the "band required" line of the graph 1200). The header 

reduction process 1 100 therefore modifies the quantization step sizes for these bands 
to values (e.g., the value of quantization band numbered 21 in this example) that will 
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be differentially encoded in the header using fewer bits. The modified values are 
depicted in the graph 1200 by the line labeled "quant, step after bit reduction." The 
particular modification of the quantization step sizes that will yield fewer bits in the 
header is dependent on the particular fomi of encoding used. Accordingly, the header 
5 reduction process 1 100 modifies the value of the quantization step sizes of the zeroed 
quantization band coefficients to a value that will encode in fewer bits for the particular 
form of quantization step encoding employed by the audio encoder (whether differential 
encoding or otherwise). 

10 V. Measuring Audio Quality 

Figure 13 shows an example of a mapping (1300) between quantization bands 
and critical bands. The critical bands are determined by an auditory model, while the 
quantization bands are determined by the encoder for efficient representation of the 
quantization matrix. The number of quantization bands can be different (typically less) 

1 5 than the number of critical bands, and the band boundaries can be different as well. In 
one implementation, the number of quantization bands relates to block size. For a 
block of 2048 frequency coefficients, the number of quantization bands is 25, and each 
quantization band maps to one of 25 critical bands of the same frequency range. For a 
block of the 64 frequency coefficients, the number of quantization bands is 13, and 

20 some quantization bands map to multiple critical bands. 

Figures 14a-14d show techniques for computing one particular type of quality 
measure - Noise to Excitation Ratio [" NER "]. Figure 14a shows a technique (1400) 
for computing NER of a block by critical bands for a single channel. The overall 
quality measure for the block is a weighted sum of NER s of individual critical bands. 

25 Figures 14b and 14c show additional detail for several stages of the technique (1400). 
Figure 14d shows a technique (701 ) for computing NER of a block by quantization 
bands. 

The inputs to the techniques (1400) and (1401) include the original frequency 
coefficients Xlk] for the block, the reconstructed coefficients X[k] (inverse 
30 quantized, inverse weighted, and inverse multi-channel transformed if needed), and 
one or more weight arrays. The one or more weight arrays can indicate 1) the relative 
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importance of different bands to perception, 2) whether bands are truncated, and/or 3) 
whether bands are noise-substituted. The one or more weight anrays can be in 
separate anrays (e.g., W[b] , Z[b] , G[b]), in a single aggregate array, or in some other 
combination. Figures 14b and 14c show other inputs such as transform blocl< size (i.e., 
5 current window/sub-frame size), maximum block size (i.e., largest time window/frame 
size), sampling rate, and the number and positions of critical bands. 

A. Computing Excitation Patterns 

With reference to Figure 14a, the encoder computes (1410) the excitation 
1 0 pattern E[b] for the original frequency coefficients X[k] and computes (1430) the 

excitation pattern E[b] for the reconstructed frequency coefficients X[k] for a block of 

audio information. The encoder computes the excitations pattern E[b] with the same 
coefficients that are used in compression, using the sampling rate and block sizes used 
in compression, which makes the process more flexible than the process for computing 
1 5 excitation patterns described in ITU-R BS 1387. In addition, several steps from ITU-R 
BS 1387 are eliminated (e.g., the adding of internal noise) or simplified to reduce 
complexity with only a little loss of accuracy. 

Figure 14b shows in greater detail the stage of computing (1410) the excitation 
pattem E[b] for the original frequency coefficients X[k] in a variable-size transform 

20 block. To compute (1430) E[b] , the input is X[k] instead of X[k] , and the process is 
analogous. 

First, the encoder normalizes (1412) the block of frequency coefficients 
X[k],0 <k< {subframe_sizeli) for a sub-frame, taking as inputs the current sub- 
frame size and the maximum sub-frame size (if not pre-detennined in the encoder). 
25 The encoder normalizes the size of the block to a standard size by interpolating values 
between frequency coefficients up to the largest time window/sub-frame size. For 
example, the encoder uses a zero-order hold technique (i.e., coefficient repetition): 
yW = aA:[it'] (18). 
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k = floor 



(19), 



max subframe size .^rw 

P = =r7^ = (20). 

subframe _size 

where Y[k] is the normalized block with interpolated frequency coefficient values, a is 
an amplitude scaling factor described below, and k' is an index in the block of 
5 frequency coefficients. The index k* depends on the interpolation factor p , which is 
the ratio of the largest sub-frame size to the current sub-frame size, if the cunrent sub- 
frame size is 1024 coefficients and the maximum size is 4096 coefficients, p is 4, and 
for every coefficient from 0-51 1 in the current transform block (which has a size of 
Q<k< [subframe _size 1 2)), the normalized block Y[k] includes four consecutive 
1 0 values. Altematively, the encoder uses other linear or non-linear interpolation 
techniques to nomnalize block size. 

The scaling factor a compensates for changes in amplitude scale that relate to 
sub-frame size. In one implementation, the scaling factor is: 

a=——^ (21), 

subframe _size 

1 5 where c is a constant with a value determined experimentally, for example, c = LO . 
Alternatively, other scaling factors can be used to normalize block amplitude scale. 

Figure 15 shows a technique (1500) for measuring the audio quality of 
normalized, variable-size blocks in a broader context than Figures 14a through 14d. A 
tool such as an audio encoder gets (1510) a first variable-size block and normalizes 
20 (1520) the variable-size block. The variable-size block is, for example, a variable-size 
transform block of frequency coefficients. The normalization can include block size 
normalization as well as amplitude scale normalization, and enables comparisons and 
operations between different variable-size blocks. 

Next, the tool computes (1530) a quality measure for the normalized block. For 
25 example, the tool computes NER for the block. 

If the tool determines (1 540) that there are no more blocks to measure quality 
for, the technique ends. Otherwise, the tool gets (1550) the next block and repeats the 
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process. For the sake of simplicity, Figure 15 does not show repeated computation of 
the quality measure (as in a quantization loop) or other ways in which the technique 
(1500) can be used in conjunction with other techniques. 

Returning to Figure 14b, after normalizing (1412) the block, the encoder 
5 optionally applies (1 41 4) an outer/middle ear transfer function to the nomnalized block. 

rW^-^'tlJ^W (22). 

Modeling the effects of the outer and middle ear on perception, the function 
A[k] generally preserves coefficients at lower and middle frequencies and attenuates 
coefficients at higher frequencies. Figure 1 6 shows an example of a transfer function 
1 0 (1 600) used in one implementation. Alternatively, a transfer function of another shape 
is used. The application of the transfer function is optional. In particular, for high 
bitrate applications, the encoder preserves fidelity at higher frequencies by not applying 
the transfer function. 

The encoder next computes (1416) the band energies for the block, taking as 
1 5 inputs the normalized block of frequency coefficients Y[k] , the number and positions 
of the bands, the maximum sub-frame size, and the sampling rate. (Alternatively, one 
or more of the band inputs, size, or sampling rate is predetermined.) Using the 
normalized block Y[k] , the energy within each aitical band b is accumulated: 

E[b]= ^Y^[k] (23). 

keB[b] 

20 where B[b] is a set of coefficient indices that represent frequencies within critical band 
b . For example, if the critical band b spans the frequency range [/,,/a) , the set 
B[b] can be given as: 

el^]^ L 1 ^"'npWgra.e ^ ^NDi-^^^^^ < /.l(24). 
[ maxjsubframe _size max_subframe size J 

So, if the sampling rate is 44.1 kHz and the maximum sub-frame size is 4096 

25 samples, the coefficient Indices 38 through 47 (of 0 to 2047) fall within a critical band 

that runs from 400 up to but not Including 510. The frequency ranges for the 

critical bands are implementation-dependent, and numerous options are well known. 
For example, see ITU-R BS 1387, the MP3 standard, or references mentioned therein. 
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Next, also in optional stages, tlie encoder smears the energies of tfie critical 
bands in frequency smearing (1418) between critical bands In tlie block and temporal 
smearing (1420) from block to block. The nonnalization of block sizes facilitates and 
simplifies temporal smearing between variable-size transfomi blocks. The frequency 
5 smearing (1 41 8) and temporal smearing (1420) are also implementation-dependent, 
and numerous options are well known. For example, see ITU-R BS 1387, the MPS 
standard, or references mentioned therein. The encoder outputs the excitation pattem 
E[b] for the block. 

Alternatively, the encoder uses another technique to measure the excitation of 
1 0 the critical bands of the block. 



B. Computing Effective Excitation Pattern 

jjfl Returning to Figure 14a, from the excitation patterns E[b] and E[b] for the 

I''f original and the reconstructed frequency <x)efficients, respectively, the encoder 

^ 1 5 computes (1 450) an effective excitation pattern E[b] . For example, the encoder finds 
JfJ the minimum excitation on a band by band basis between E[b] and E[b] : 

't E[b]^Mm(E[b\E[b^ (25). 

Alternatively, the encoder uses another formula to determine the effective 
excitation pattern. Excitation in the reconstructed signal can be more than or less the 
20 excitation in the original signal due to the effects of quantization. Using the effective 
excitation pattern £^[6] rather than the excitation pattern E[b] for the original signal 
ensures that the masking component is present at reconstruction. For example, If the 
original frequency coefficients in a band are heavily quantized, the masking component 
that is supposed to be in that band might not be present in the reconstructed signal, 
25 making noise audible rather than inaudible. On the other hand, if the excitation at a 
band in the reconstructed signal is much greater than the excitation at that band in the 
original signal, the excess excitation in the reconstructed signal may itself be due to 
noise, and should not be factored into later NER calculations. 

Figure 17 shows a technique (1700) for computing an effective masking 
30 measure in a broader context than Figures 7a through 7d. A tool such as an audio 
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encoder computes (1710) an original audio masking measure. For example, the tool 
computes an excitation pattern for a blocl< of original frequency coefficients. 
Alternatively, the tool computes another type of masking measure (e.g., masking 
threshold), measures something other than blocks (e.g., channels, entire signals), 
5 and/or measures another type of infomiation. 

The tool computes (1720) a reconstructed audio masking measure of the same 
general format as the original audio masking measure. 

Next, the tool computes (1730) an effective masking measure based at least in 
part upon the original audio masking measure and the reconstructed audio masking 
1 0 measure. For example, the tool finds the minimum of two excitation patterns. 

Alternatively, the tool uses another technique to determine the effective excitation 
masking measure. For the sake of simplicity, Figure 17 does not show repeated 
computation of the effective masking measure (as in a quantization loop) or other ways 
in which the technique (1700) can be used in conjunction with other techniques. 

15 

C. Computing Noise Pattern 

Returning to Figure 14a, the encoder computes (1470) the noise pattern F[b] 
from the difference between the original frequency coefficients and the reconstructed 
frequency coefficients. Alternatively, the encoder computes the noise pattern F[b] 

20 from the difference between time series of original and reconstructed audio samples. 
The computing of the noise pattern F[b] uses some of the steps used in computing 
excitation patterns. Figure 14c shows in greater detail the stage of computing (1470) 
the noise pattern F[b] . 

First, the encoder computes (1472) the differences between a block of original 

25 frequency coefficients X[k] and a block of reconstructed frequency coefficients X[k] 
for 0 < A; < {subframe _size 1 2) . The encoder normalizes (1474) the block of 
differences, taking as inputs the current sub-frame size and the maximum sub-frame 
size (if not pre-determined in the encoder). The encoder normalizes the size of the 
block to a standard size by interpolating values between frequency coefficients up to 
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the largest time window/sub-frame size. For example, the encoder uses a zero-order 
hold technique (i.e., coefficient repetition): 



where DY[k] is the normalized block of interpolated frequency coefficient differences, 
5 a is an amplitude scaling factor described in Equation (10), and k' is an index in the 
sub-frame block described in Equation (8). Alternatively, the encoder uses other 
techniques to normalize the block. 

After normalizing (1474) the block, the encoder optionally applies (1476) an 
outer/middle ear transfer function to the normalized block. 
10 DY[k]^A[klDY[k] (27), 

where Alk] is a transfer function as shown, for example, in Figure 16. 

The encoder next computes (1478) the band energies for the block, taking as 
inputs the nomrialized block of frequency cxDefficient differences i)r[fc], the number 
and positions of the bands, the maximum sub-frame size, and the sampling rate. 
1 5 (Alternatively, one or more of the band inputs, size, or sampling rate is predetermined.) 
Using the normalized block of frequency coefficient differences Z)7[^], the energy 
within each critical band b is accumulated: 



where B[b] is a set of coefficient indices that represent frequencies within critical band 
20 b as described in Equation 13. As the noise pattern F[b] represents a masked signal 
rather than a masking signal, the encoder does not smear the noise patterns of critical 
bands for simultaneous or temporal masking. 

Alternatively, the encoder uses another technique to measure noise in the 
critical bands of the block. 



D. Band Weights 

Before computing NER for a block, the encoder detemnines one or more sets 
of band weights for NER of the block. For the bands of the block, the band weights 
indicate perceptual weightings, which bands are noise-substituted, which bands are 




(26), 



F[b]= Y^Dr[k] 



(28), 



keBlb] 



25 
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truncated, and/or other weighting factors. The different sets of band weights can be 
represented in separate arrays (e.g., W[b], G{b], and Z[b]), assimilated into a single 
array of weights, or combined in other ways. The band weights can vary from block to 
blocl< in terms of weight amplitudes and/or numbers of band weights. 
5 Figure 18 shows a technique (1800) for computing a band-weighted quality 

measure for a block in a broader context than Figures 14a through 14d. A tool such as 
an audio encoder gets (1810) a first block of spectral infomriation and determines 
(1820) band weights for the block. For example, the tool computes a set of perceptual 
weights, a set of weights indicating which bands are noise-substituted, a set of weights 
1 0 indicating which bands are truncated, and/or another set of weights for another 
weighting factor. Alternatively, the tool receives the band weights from another 
module. Within an encoding session, the band weights for one block can be different 
than the band weights for another block in terms of the weights themselves or the 
number of bands. 

1 5 The tool then computes (1 830) a band-weighted quality measure. For example, 

the tool computes a band-weighted NER The tool determines (1840) if there are 
more blocks. If so, the tool gets (1850) the next block and determines (1820) band 
weights for the next block. For the sake of simplicity, Figure 18 does not show different 
ways to combine sets of band weights, repeated computation of the quality measure 
20 for the block (as in a quantization loop), or other ways in which the technique (1800) 
can be used in conjunction with other techniques. 

1 . Perceptual Weights 

With reference to Figure 14a, a perceptual weight array W[b] accounts for the 

25 relative importance of different bands to the perceived quality of the reconstructed 
audio. In general, bands for middle frequencies are more important to perceived 
quality than bands for low or high frequencies. Figure 19 shows an example of a set of 
perceptual weights (1900) for critical bands for NER computation. The middle critical 
bands are given higher weights than the lower and higher critical bands. The 
30 perceptual weight array W[b] can vary in terms of amplitudes from block to block 
within an encoding session; the weights can be different for different patterns of audio 



-••Sit 



SAW: 12/14/01 3382-61344 180529.1 Express Mail No. EL 874429730 US 

42 

information (e.g., different excitation patterns), different applications (e.g., speecli 
coding, music coding), different sampling rates (e.g., 8 kHz, 96 kHz), different bitrates 
of coding, or different levels of audibility of target listeners (e.g., playback at 40 dB, 96 
dB). The perceptual weight array W[b] can also change in response to user input 
5 (e.g., a user adjusting weights based on the user's preferences). 

2. Noise Substitution 

In one implementation, the encoder can use noise substitution (rather than 
quantization of spectral information) to parametrically convey audio information for a 
1 0 band in low and mid-bitrate coding. The encoder considers the audio pattern (e.g., 
hamnonic, tonal) in deciding whether noise substitution is more efficient than sending 
quantized spectral infonmation. Typically, the encoder starts using noise substitution 
for higher bands and does not use noise substitution at all for certain bands. When the 
generated noise pattern for a band is combined with other audio information to 
1 5 reconstruct audio samples, the audibility of the noise is comparable to the audibility of 
the noise associated with an actual noise pattern. 

Generated noise patterns may not integrate well with quality measurement 
techniques designed for use with actual noise and signal patterns, however. Using a 
generated noise pattern for a completely or partially noise-substituted band, NER or 
20 another quality measure may inaccurately estimate the audibility of noise at that band. 

For this reason, the encoder of Figure 14a does not factor the generated noise 
patterns of the noise-substituted bands into the NER . The array G[b] indicates which 
critical bands are noise-substituted in the block with a weight of 1 for each noise- 
substituted band and a weight of 0 for each other band. The encoder uses the array 
25 G[b] to skip noise-substituted bands when computing NER . Alternatively, the array 
G[b] includes a weight of 0 for noise-substituted bands and 1 for all other bands, and 

the encoder multiplies the NER by the weight 0 for noise-substituted bands; or, the 
encoder uses another technique to account for noise substitution in quality 
measurement. 

30 An encoder typically uses noise substitution with respect to quantization bands. 

The encoder of Figure 14a measures quality for critical bands, however, so the 
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encoder maps noise-substituted quantization bands to critical bands. For example, 
suppose the spectrum of noise-substituted quantization band d overlaps (partially or 
completely) the spectrum of critical bands bi^^^ through b^-^^^ . The entries G[6;,^^] 

through G{bj^^^] are set to indicate noise-substituted bands. Alternatively, the 

5 encoder uses another linear or non-linear technique to map noise-substituted 
quantization bands to critical bands. 

For multi-channel audio, the encoder computes NER for each channel 
separately. If the multi-channel audio is in independently coded channels, the encoder 
can use a different array G[b] for each channel. On the other hand, if the multi- 
1 0 channel audio is in jointly coded channels, the encoder uses an identical array G[b] for 

,f •** 

C3 all reconstructed channels that are jointly coded. If any of the jointly coded channels 

C has a noise-substituted band, when the jointly coded channels are transformed into 

13,1 

iU independently coded channels, each independently coded channel will have noise from 

i'!^ the generated noise pattern for that band. Accordingly, the encoder uses the same 

^ 1 5 array G[b] for all reconstructed channels, and the encoder includes fewer arrays G[b] 

m in the output bitstream, lowering overall bitrate. 

More generally, Figure 20 shows a technique (2000) for measuring audio 
£•^1 quality in a channel mode-dependent manner. A tool such as an audio encoder 

optionally applies (2010) a multi-channel transform to multi-channel audio. For 
20 example, a tool that works with stereo mode audio optionally outputs the stereo audio 
in Independently coded channels or in jointly coded channels. 

The tool determines (2020) the channel mode of the multi-channel audio and 
then measures quality in a channel mode-dependent manner. If the audio is in 
independently coded channels, the tool measures (2030) quality using a technique for 
25 independently coded channels, and if the audio is in jointly coded channels, the tool 
measures (2040) quality using a technique for jointly coded channels. For example, 
the tool uses a different band weighting technique depending on the channel mode. 
Alternatively, the tool uses a different technique for measuring noise, excitation, 
masking capacity, or other pattern in the audio depending on the channel mode. 
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While Figure 20 shows two modes, other numbers of modes are possible. For 
the sake of simplicity, Figure 20 does not show repeated computation of the quality 
measure for the block (as in a quantization loop), or other ways in which the technique 
(2000) can be used in conjunction with other techniques. 
5 3. Band Truncation 

In one implementation, the encoder can truncate higher bands to improve audio 
quality for the remaining bands. The encoder can adaptively change the threshold 
above which bands are truncated, truncating more or fewer bands depending on 
current quality measurements. 

1 0 When the encoder truncates a band, the encoder does not factor the quality 

measurement for the truncated band into the NER . With reference to Figure 14a, the 
array Z[b] rndicates which bands are truncated in the block with a weighting pattern 
such as one described above for the array G[b] . When the encoder measures quality 
for critical bands, the encoder maps truncated quantization bands to critical bands 

1 5 using a mapping technique such as one described above for the array G[b] . When the 
encoder measures quality of multi-channel audio in jointly coded channels, the encoder 
can use the same array Z[b] for all reconstructed channels. 



E. Computing Noise to Excitation Ratio 

20 With reference to Figure 14a, the encoder next computes (790) band-weighted 

NER for the block. For the critical bands of the block, the encoder computes the ratio 

of the noise pattern F[b] to the effective excitation pattem E[b] . The encoder weights ^ 
the ratio with band weights to determine the band-weighted NER for a block of a 
channel c: 

25 NER[c] = ^Wlb]^ (29). 

allft E[b\ 

Another equation for NER[c] if the weights W[b] are not normalized is: 
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Instead of a single set of band weights representing one kind of weigliting factor 
or an aggregation of all weighting factors, the encoder can work with multiple sets of 
band weights. For example, Figure 14a shows three sets of band weights W[b] , G[b] , 
and Z[b] , and the equation for NER[c] is: 



For other formats of the sets of band weights, the equation for band-weighted 
NER[c] varies accordingly. 

For multi-channel audio, the encoder can compute an overall NER from 
NER[c] of each of the multiple channels. In one implementation, the encoder 
computes overall NER as the maximum distortion over all channels: 



Alternatively, the encoder uses another non-linear or linear function to compute 
overall NER from NER[c] of multiple channels. 

F. Computing Noise to Excitation Ratio with Quantization Bands 

Instead of measuring audio quality of a block by critical bands, the encoder can 
measure audio quality of a block by quantization bands, as shown in Figure 14d. 

The encoder computes (1410, 1430) the excitation patterns E[b] and E[b] , 
computes (1450) the effective excitation pattern E[b], and computes (1470) the noise 
pattern F[b] as in Figure 14a. 

At some point before computing (791) the band-weighted NER , however, the 
encoder converts all patterns for critical bands into patterns for quantization bands. 
For example, the encoder converts (780) the effective excitation pattern E[b] for 

critical bands into an effective excitation pattern E[d] for quantization bands. 
Alternatively, the encoder converts from critical bands to quantization bands at some 
other point, for example, after computing the excitation patterns. In one 




(31). 



NER,^,,,,=MAX{NER[c]) 



(32). 
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implementation, the encoder creates E[d] by weighting E[b] according to proportion 
of spectral overlap (i.e., overlap of frequency ranges) of the critical bands and the 
quantization bands. Alternatively, the encoder uses another linear or non-linear 
weighting techniques for the band conversion. 
5 The encoder also converts (785) the noise pattern F[b] for critical bands into a 

noise pattem F[d] for quantization bands using a band weighting technique such as 

one described above for E[d] . 

Any weight arrays with weights for critical bands (e.g., W[b] ) are converted to 
weight arrays with weights for quantization bands (e.g., W[d] ) according to proportion 
1 0 of band spectrum overlap, or some other technique. Certain weight arrays (e.g., G[d] , 
Z[d]) may start in terms of quantization bands, in which case conversion is not 
required. The weight arrays can vary in terms of amplitudes or number of quantization 
bands within an encoding session. 

The encoder then computes (791) the band-weighted as a summation over the 
1 5 quantization bands, for example using an equation given above for calculating NER 
M for critical bands, but replacing the indices b with d . 

Having described and illustrated the principles of our invention with reference to 
\^ an illustrative embodiment, it will be recognized that the illustrative embodiment can be 

modified in arrangement and detail without departing from such principles. It should be 
20 understood that the programs, processes, or methods described herein are not related 
or limited to any particular type of computing environment, unless indicated othen^^ise. 
Various types of general purpose or specialized computing environments may be used 
with or perform operations in accordance with the teachings described herein. 
Elements of the illustrative embodiment shown in software may be implemented in 
25 hardware and vice versa. 

In view of the many possible embodiments to which the principles of our 
invention may be applied, we claim as our invention all such embodiments as may 
come within the scope and spirit of the following claims and equivalents thereto. 



fas 



